library(readr)
library(dplyr)
library(ggplot2)
library(methods)
Let’s look again at the flowers dataset. First, we load the metadata. This is exactly the same as any other dataset in which we pull in a CSV file from GitHub:
flowers <- read_csv("https://statsmaths.github.io/ml_data/flowers_17.csv")
flowers
## # A tibble: 1,360 x 4
## obs_id train_id class class_name
## <chr> <chr> <dbl> <chr>
## 1 id_000362 valid 4 crocus
## 2 id_000506 train 6 tigerlily
## 3 id_000778 valid 9 sunflower
## 4 id_001233 train 15 windflower
## 5 id_000274 train 3 bluebell
## 6 id_001218 train 15 windflower
## 7 id_001280 test NA <NA>
## 8 id_000895 train 11 colts foot
## 9 id_000851 valid 10 daisy
## 10 id_000084 train 1 snowdrop
## # … with 1,350 more rows
Then, we also have to grab the image data itself. To do this, first download the dataset here:
Save it somewhere on your computer and then read it into R:
x64 <- read_rds("image_data/flowers_17_x64.rds")
I again only want to look at the first 10 types of flowers.
x64 <- x64[flowers$class %in% 0:9,,,]
flowers <- flowers[flowers$class %in% 0:9,]
fnames <- flowers$class_name[match(0:9, flowers$class)]
fnames <- factor(fnames, levels = fnames)
If we want to improve our model further beyond dense neural networks, we need to include information beyond just the color of the flower. When we look at the images, our brains also use information about shape and texture. Let’s try to find a way to measure this in the image.
I will start by taking a sample flower image and creating a black and white version of it. A simple way to do this is to average the red, green, and blue pixels.
i <- 50
bw <- (x64[i,,,1] + x64[i,,,2] + x64[i,,,3]) / 3
plot(0,0,xlim=c(0,1),ylim=c(0,1),axes= FALSE,type = "n")
rasterImage(bw,0,0,1,1)
To detect texture we can take the brightness of each pixel and subtract it from the brightness of the pixel to its lower right. We can do this in a vectorized fashion as such:
edge <- abs(bw[-1,-1] - bw[-nrow(bw),-ncol(bw)])
plot(0,0,xlim=c(0,1),ylim=c(0,1),axes= FALSE,type = "n")
rasterImage(edge,0,0,1,1)
The resulting image roughly detects edges in the image. Notice that is has only 63-by-63 pixels due to the fact that we cannot compute this measurement on the rightmost or bottommost edges of the plot.
We’ll do this for each image, and save the number of pixels that have an edge value greater than 0.1. You could of course play around with this cutoff, or save a number of different cutoff values. This number will tell us roughly how much of the image consists of edges. A low number indicates a smooth petal and a a high one indicates a grassy texture to the flower.
mean_edge <- rep(0, nrow(flowers))
for (i in seq_len(nrow(flowers))) {
bw <- (x64[i,,,1] + x64[i,,,2] + x64[i,,,3]) / 3
edge <- abs(bw[-1,-1] - bw[-nrow(bw),-ncol(bw)])
mean_edge[i] <- mean(edge > 0.1)
}
A boxplot shows that there are differences between the flowers in this measurement. Crocuses in particular have a lot of edges.
qplot(flowers$class_name, mean_edge, geom = "blank") +
geom_boxplot() +
coord_flip() +
theme_minimal()
Most of the photos have a flower in the middle, but the background may include grass, sky, or other non-related elements. Let’s repeat the edge detector but now only such as the degree of edge-ness only for the middle of the image.
mean_edge_mid <- rep(0, nrow(flowers))
for (i in seq_len(nrow(flowers))) {
bw <- (x64[i,,,1] + x64[i,,,2] + x64[i,,,3]) / 3
edge <- abs(bw[-1,-1] - bw[-nrow(bw),-ncol(bw)])
mean_edge_mid[i] <- mean(edge[20:44,20:44] > 0.1)
}
This shows a clearly differentiation of the flowers. Fritillary have a lot of edges due to their spots in the middle of the photo. Notice that the patterns here are quite different from those in the whole image.
qplot(flowers$class_name, mean_edge_mid, geom = "blank") +
geom_boxplot() +
coord_flip() +
theme_minimal()
We will create a data matrix by putting together the color information with the mean_edge and mean_edge_mid metrics.
color_vals <- c(hsv(1, 0, seq(0, 1, by = 0.2)),
hsv(seq(0, 0.9, by = 0.1), 1, 1))
X_hsv <- matrix(0, ncol = length(color_vals),
nrow = nrow(flowers))
for (i in seq_len(nrow(flowers))) {
red <- as.numeric(x64[i,,,1])
green <- as.numeric(x64[i,,,2])
blue <- as.numeric(x64[i,,,3])
hsv <- t(rgb2hsv(red, green, blue, maxColorValue = 1))
color <- rep("#000000", nrow(hsv))
index <- which(hsv[,2] < 0.2)
color[index] <- hsv(1, 0, round(hsv[index,3] * 5) / 5)
index <- which(hsv[,2] > 0.2 & hsv[,3] > 0.2)
color[index] <- hsv(round(hsv[index,1],1), 1, 1)
X_hsv[i,] <- table(factor(color, levels = color_vals))
}
X_edge <- cbind(X_hsv, mean_edge, mean_edge_mid)
y <- flowers$class
X_train <- X_edge[flowers$train_id == "train",]
X_valid <- X_edge[flowers$train_id == "valid",]
y_train <- y[flowers$train_id == "train"]
y_valid <- y[flowers$train_id == "valid"]
library(glmnet)
model <- cv.glmnet(X_train, y_train, family = "multinomial",
alpha = 0.2)
plot(model)
I’ve included the cross-validation curve because it is a perfect textbook example of what the curve should look like (but rarely does so nicely). The resulting model performs better than the color alone.
pred <- as.numeric(predict(model, newx = X_edge,
type = "class"))
tapply(pred == y, flowers$train_id, mean)
## train valid
## 0.64 0.45
A confusion matrix shows us that only a few flowers are still difficult to differentiate.
table(pred = fnames[pred[flowers$train_id == "valid"] + 1],
y = y[flowers$train_id == "valid"])
## y
## pred 0 1 2 3 4 5 6 7 8 9
## daffodil 7 1 1 1 0 1 0 1 1 5
## snowdrop 3 12 5 1 1 1 3 6 1 0
## lily valley 1 2 12 3 0 0 5 3 1 1
## bluebell 0 0 0 8 1 5 0 0 3 0
## crocus 0 1 0 3 15 2 1 3 0 0
## iris 0 0 1 0 0 7 0 0 2 0
## tigerlily 2 3 1 1 2 0 9 3 2 0
## tulip 5 1 0 0 0 1 1 1 2 3
## fritillary 0 0 0 1 1 2 0 0 8 0
## sunflower 2 0 0 2 0 1 1 3 0 11
The next step would be to figure out what features would help distinguish the “snowdrop”, “daffodil”, and “bluebell” flowers from the others as false positives and negatives from these groups are causing a large portion of the remaining errors.
We know that it is possible to unravel the pixel counts describing an image to turn images into high-dimensional matrix. The columns from this can be put into an elastic net or a neural network to train much as we would with any other numeric dataset. There is, however, quite a lot of information contained in the structure of the image data that we are losing by this approach. Ideally, we would use the information about the fact that some pixels as close to one another and that certain color channels describe the same or neighboring pixels.
The solution is to use convolutional neural networks (CNNs). Despite their name, CNNs are not actually a different kind of neural network but instead refer to a particular kind of layer in a neural network. In a purely pragmatic sense of writing the code, adding convolutional layers into a keras model is very easy. Understanding what they are actually doing, however, can sometimes be difficult. Today we will spend some time trying to build up to convolutional layers before showing how they work on our data.
We will start by considering convolutions created manually outside of a neural network. To start, we need a kernel matrix. Let’s use a kernel with one row and 2 columns:
\[ k = [1, -1] \]
The convolution implied by this kernel takes the pixels in an image and subtracts the value of the pixel value to its immediate right. If the input image has a resolution of 28-by-28, what size and shape with the result of this convolution be? Without modification, it needs to be 28-by-27. The lost column comes because we do not have a way of applying the convolution to pixels in the right-most column of the image. This is usually fixed (keeping the image size constant is useful) by adding a virtual column of 0’s. With this, the result of the convolution is another image of size 28-by-28.
Let’s apply this to a small example. Here we have an input “image” of only 8-by-8. The image seems to show something like a capital letter “L”.
x <- matrix(0, ncol = 8, nrow = 8)
x[2:7, 2] <- c(0.7, 0.7, 0.7, 0.7, 0.7, 0.7)
x[7, 3:5] <- c(0.7, 0.7, 0.7)
x
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,] 0 0.0 0.0 0.0 0.0 0 0 0
## [2,] 0 0.7 0.0 0.0 0.0 0 0 0
## [3,] 0 0.7 0.0 0.0 0.0 0 0 0
## [4,] 0 0.7 0.0 0.0 0.0 0 0 0
## [5,] 0 0.7 0.0 0.0 0.0 0 0 0
## [6,] 0 0.7 0.0 0.0 0.0 0 0 0
## [7,] 0 0.7 0.7 0.7 0.7 0 0 0
## [8,] 0 0.0 0.0 0.0 0.0 0 0 0
By constructing the variable x2 as version of x padded with zeros, we can apply the kernel matrix as follows:
z <- matrix(0, ncol = 8, nrow = 8)
x2 <- cbind(x, 0)
kernel <- matrix(c(1, -1), nrow = 1)
for (i in seq_len(nrow(x))) {
for (j in seq_len(ncol(x))) {
z[i, j] <- sum(x2[i, j:(j+1), drop = FALSE] * kernel)
}
}
z
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,] 0.0 0.0 0 0 0.0 0 0 0
## [2,] -0.7 0.7 0 0 0.0 0 0 0
## [3,] -0.7 0.7 0 0 0.0 0 0 0
## [4,] -0.7 0.7 0 0 0.0 0 0 0
## [5,] -0.7 0.7 0 0 0.0 0 0 0
## [6,] -0.7 0.7 0 0 0.0 0 0 0
## [7,] -0.7 0.0 0 0 0.7 0 0 0
## [8,] 0.0 0.0 0 0 0.0 0 0 0
Typically we will apply a ReLU activation to the output of a convolution, so we care only about the positive values. Notice here that almost all of these occur where there is a vertical edge to the capital “L”. There is also one bit of an activation at the end of the bottom part of the letter.
Now, consider a different kernel given by a matrix with one column and two rows:
\[ k = \left[\begin{array}{c} 1 \\ -1 \end{array} \right] \]
To apply this, we take each pixel value and subtract the value immediately below it in the image. By padding the input with an extra row of 0’s, we can get an output image that is the same size as the input. Let’s apply this to the matrix x as well:
y <- matrix(0, ncol = 8, nrow = 8)
x2 <- rbind(x, 0)
kernel <- matrix(c(1, -1), ncol = 1)
for (i in seq_len(nrow(x))) {
for (j in seq_len(ncol(x))) {
y[i, j] <- sum(x2[i:(i+1), j, drop = FALSE] * kernel)
}
}
y
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,] 0 -0.7 0.0 0.0 0.0 0 0 0
## [2,] 0 0.0 0.0 0.0 0.0 0 0 0
## [3,] 0 0.0 0.0 0.0 0.0 0 0 0
## [4,] 0 0.0 0.0 0.0 0.0 0 0 0
## [5,] 0 0.0 0.0 0.0 0.0 0 0 0
## [6,] 0 0.0 -0.7 -0.7 -0.7 0 0 0
## [7,] 0 0.7 0.7 0.7 0.7 0 0 0
## [8,] 0 0.0 0.0 0.0 0.0 0 0 0
The positive values from this kernel all occur on the bottom edge of the “L”. So the first kernel detects vertical edges and the second one detects horizontal images. Presumably, knowing where in the image these types of lines are would be very helpful in identifying digits, fashion items, letters, and other types of objects.
If we have a training set here of 1000 images, our input data will have a dimension of
\[ 1000 \times 8 \times 8 \times 1 \]
Because these are 8-by-8 black and white images. What is the dimension after applying these two kernels? Well, for each kernel we have an image of the same size as the original, so we have
\[ 1000 \times 8 \times 8 \times 2 \]
Each of the outputs from the kernels is called a filter. Here we have two filters. You can think of these similarly to the red, green, and blue components of a color image. Each filter tells something useful about a particular part of the original image. Here, it tells whether there is a vertical edge (first component) or horizontal edge (second component).
We could apply another convolution to the output of the first set of convolutions. The kernel here would need to be three dimensional, with a depth of 2, because it has to say what to do with each of the two filters. Likewise, a kernel for an input color image needs three dimensions as well.
Once we have applied convolutions, you an imagine for most applications we do not care exactly where edges or any other features are found. For digit detection we instead just care generally where edges of a particular type are found. With a large number of filters, the data size after a convolution can also quickly become quite large. A solution to this is known as max pooling. We reduce the width and height of each input by a factor of two by dividing the image into 2-by-2 blocks and taking the maximum value of each filter within a block. Here, we apply max pooling, as well as ReLU activations, to the values in the vertical filter:
relu <- function(mat) {
id <- which(mat <= 0)
mat[id] <- 0
mat
}
w <- matrix(0, ncol = 4, nrow = 4)
for (i in seq_len(nrow(x) / 2)) {
for (j in seq_len(ncol(x) / 2)) {
box <- relu(z[(2*i-1):(2*i), (2*j-1):(2*j)])
w[i, j] <- max(box)
}
}
w
## [,1] [,2] [,3] [,4]
## [1,] 0.7 0 0.0 0
## [2,] 0.7 0 0.0 0
## [3,] 0.7 0 0.0 0
## [4,] 0.0 0 0.7 0
In theory, we can pool using other sizes, such as a grid of 3-by-3 points. However, the 2-by-2 is the most common and rarely do we need anything else.
If we apply max pooling, the dataset now has a size of:
\[ 1000 \times 4 \times 4 \times 2 \]
Consider another convolution with 5 filters. The resulting size becomes:
\[ 1000 \times 4 \times 4 \times 5 \]
With max pooling again, we get:
\[ 1000 \times 2 \times 2 \times 5 \]
Once we have used enough combinations of pooling and convolution, the array can then be unravelled to form a dataset of size
\[ 1000 \times 20 \]
This data is small relative to the input and has already learned localized features. A dense neural network can then use it to produce predicted probabilities.
Now, we will apply convolutions and max pooling the context of an actual neural network. This is exactly the same as described in our small example above, however the values of the weights in the kernels are learned from the data rather than being pre-determined. This means that they have the power to detect patterns we would not have thought of, but it also comes at the cost of not longer being able to describe exactly what each filter is doing.
We start by working with the MNIST dataset again. We need the data to remain in its array-format, so we will not collapse it into a matrix this time.
library(keras)
mnist <- read_csv("https://statsmaths.github.io/ml_data/mnist_10.csv")
X <- read_rds("image_data/mnist_10_x28.rds")
y <- mnist$class
X_train <- X[mnist$train_id == "train",,,,drop = FALSE]
y_train <- to_categorical(y[mnist$train_id == "train"], num_classes = 10)
We now build a simple convolution neural network, with just one convolutional layer with 16 filters, followed by max pooling. Typically, unlike our example above, we use square kernels. Most often these are of size 3-by-3, though 5-by-5 and even 1-by-1 are seen in certain architectures. Note that we need to fully describe the correct dimension of X_train. We also need a layer called layer_flatten when going from the convolutional part of the network to the dense part.
model <- keras_model_sequential()
model %>%
layer_conv_2d(filters = 16, kernel_size = c(3, 3),
input_shape = dim(X_train)[-1],
padding = "same") %>%
layer_max_pooling_2d(pool_size = c(2, 2)) %>%
layer_activation(activation = "relu") %>%
layer_flatten() %>%
layer_dense(units = 10) %>%
layer_activation(activation = "softmax")
model %>% compile(loss = 'categorical_crossentropy',
optimizer = optimizer_sgd(lr = 0.01, momentum = 0.8),
metrics = c('accuracy'))
model
## Model
## ___________________________________________________________________________
## Layer (type) Output Shape Param #
## ===========================================================================
## conv2d_1 (Conv2D) (None, 28, 28, 16) 160
## ___________________________________________________________________________
## max_pooling2d_1 (MaxPooling2D) (None, 14, 14, 16) 0
## ___________________________________________________________________________
## activation_1 (Activation) (None, 14, 14, 16) 0
## ___________________________________________________________________________
## flatten_1 (Flatten) (None, 3136) 0
## ___________________________________________________________________________
## dense_1 (Dense) (None, 10) 31370
## ___________________________________________________________________________
## activation_2 (Activation) (None, 10) 0
## ===========================================================================
## Total params: 31,530
## Trainable params: 31,530
## Non-trainable params: 0
## ___________________________________________________________________________
Setting padding to “same” is what maxes it so that the first layer outputs images of size 28-by-28.
history <- model %>% fit(X_train, y_train, epochs = 10,
validation_split = 0.1)
plot(history)
Notice that this model does not have many weight compared to the neural networks we used last time. It takes a fairly long time to run, though, given its size. The reason is that convolutions generally do not have a large weights because for each filter we are only learning 2-by-2-by-(num prior filters) values, unlike the dense layers that connect everything in one layer to everything in the prior layer. Computing the gradient and the inputs to the next layer, however, take a long time because we have to apply each kernel to the entire image.
Much as we did with the weights in the dense neural networks, we can visualize the kernels learned by the neural network. To start, make sure that the sizes of the weights in the convolutional layer make sense:
layer <- get_layer(model, index = 1)
dim(layer$get_weights()[[1]])
## [1] 3 3 1 16
dim(layer$get_weights()[[2]])
## [1] 16
The 16 in each refers to the specific kernel, with all 16 put into a single 4-dimensional array.
par(mar = c(0,0,0,0))
par(mfrow = c(4, 4))
for (i in 1:16) {
wg <- layer$get_weights()[[1]][,,,i]
im <- abs(wg) / max(abs(wg))
plot(0,0,xlim=c(0,1),ylim=c(0,1),axes= FALSE,type = "n")
rasterImage(im,0,0,1,1,interpolate=FALSE)
box()
}
There is not much we can directly do with these, but it is good to see them in order to check whether you understand what these convolutions are doing to the input image.